OVERVIEW:

This repository contains the replication files for the article "Power in Text: Implementing Networks and Institutional Complexity in American Law." The repository contains four basic parts:

1. Code to download and process the underlying text files for the paper.
2. Training information and fit models for the named entity recognition portion of the paper.
3. Predictions from the named entity recognition model, as well as code to convert those outputs into a network representation.
4. Bayesian modeling code, model outputs, and plotting code for the results in the body and appendix of the paper. 

For conciseness, I've organized a short summary of each the various scripts used in this paper in this document, as well as the order in which scripts should be run if a researcher wishes to replicate any particular step of this analysis. See documentation individual scripts for further information. Scripts should be run in order to guarantee good results.

FILE STRUCTURE:
 - code.tar.gz
  - Folder of scripts that scrape/process data, format outputs from the NER models, and produce the final dataset used in regression models.
 - data.tar.gz
  - Data folder containing auxiliary data, raw legislative texts, and formatted legislative texts for use with the NER models.
 - tf_ner
  - data 
   - Training/testing data for NER models, processed wtih GloVe indices (not included)
  - models
   - Fit NER models, plus code and processing scripts
 - out.tar.gz
  - Outputs from code.tar.gz, including the final dataset used for Bayesian modeling and individual graphs for each law.
 - modeling.tar.gz
  - Bayesian modeling code, as well as fit models and summary figures from the text of the paper. 

DATA COLLECTION
1. Cornell_LII_Law_Names.ipynb
 - Webscraper, which collects all "common names" of laws from the Cornell LII website.
 - These names are filtered out of law texts in Hitchhiker_Data_Collection.ipynb, to avoid false positive named entity mentions.
 - Results are saved as "data/common_names.json". 

2. Hitchhiker_Data_Collection.ipynb
 - Webscraper, which relies on hitchhiker_data_casas_denny_wilkerson.csv (replication materials from Casas et al., 2019).
 - This script uses URLS from the Casas et al. (2018) data to scrape law texts from congress.gov.
 - Next, using Casas et al. (2018)'s "insertion" information, this script removes the text of "inserted" bills from their destination documents.
 - The script also does some simple preprocessing steps, e.g. removing law common names, removing tables of contents, etc.
 - The output here is one file for each law, which contains text at various preprocessing stages as well as metadata from Casas et al. (2019).
 - Results are saved as "data/legislative_texts.tar.gz".

3a. NER_Training_Set.ipynb
 - Processing script, which formats the text data from the Hitchhiker_Data_Collection.ipynb into the correct format for the NER prediction model.
 - Also splits the data into "folds" appropriate for cross-validation.
 - Note that the splitting in this dataset happens at the level of the pre-built entity dictionary (agency_list.txt), *not* at the level of the sentence.
 - Results are saved in "data/NER_train_test.tar.gz".

3b. NER_Prediction_Set.ipynb
 - Processing script, which formats the text data from the Hitchhiker_Data_Collection.ipynb into the correct format for the NER prediction model.
 - Unlike NER_Training_Set.ipynb, this script splits sentences with no entities from the rest of the dataset, which we'll predict on later.
 - Results are saved in "data/NER_train_test.tar.gz".

4. Entity_Extraction.ipynb
 - Network construction script, which takes the outputs from the sequence_tagging LSTM-CRF model and extracts named entity graphs from the output.
 - Outputs are saved as "data/out.csv" (for the per-observation summary statistics) and "data/graphs.tar.gz" (for the graph data for each document).

5. CQ_Scraper.ipynb
 - Webscraper, which searches CQ Press's "year-end" summary of legislative activity to check whether a law number was mentioned in that summary.
 - This data is then appended to the dataset produced in the previous file.
 - Output is saved as "data/out_cq_added.csv".

NAMED ENTITY RECOGNITION
 - Relies on the sequence_tagging LSTM-CRF implementation. 
 - Model results and input data are saved in compressed .tar.gz files in the tf_ner folder. 
 - Uncompress and run individiual scripts within each folder to fit models, predict, and assess fit.
 - To replicate CV results reported in-text, fit each CV fold, note fit statistics (printed on command line), and average across folds.
 - See package GitHub page (linked below) for instructions and further documentation. 

ANALYSIS
1. model_fitting.R (ADD FIGURE NUMBERS, CONFIRM THAT NO EXTRANEOUS MODELS ARE INCLUDED)
 - Model-fitting script, which processes data produced by the previous set of scripts and fits all in-text and appendix models.

2. plotting.R (ADD FIGURE NUMBERS)
 - Plotting script, which produces all figures in-text and in the appendix.

REQUIRED DATA FILES AND LIBRARIES:
1. agency_list.txt
 - List of government entity names (one name per line).
 - Entity names were scraped, de-duplicated, and then manually reviewed/cleaned.
 - Based on a list of names scraped in 2018 from:
  - https://www.usa.gov/federal-agencies
  - https://www.federalregister.gov/agencies
  - https://en.wikipedia.org/wiki/List_of_federal_agencies_in_the_United_States
  - https://en.wikipedia.org/wiki/List_of_defunct_or_renamed_United_States_federal_agencies
  - https://en.wikipedia.org/wiki/List_of_current_United_States_House_of_Representatives_committees
  - https://en.wikipedia.org/wiki/List_of_current_United_States_Senate_committees
  - https://en.wikipedia.org/wiki/Category:Joint_committees_of_the_United_States_Congress

2. hitchhiker_data_casas_denny_wilkerson.csv
 - Replication data from Casas et al. (2019).
 - Used for bill- and member-level metadata and links to legislative texts, as well as information about "hitchhiker" status of bills.
 - Source URL: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/7ZVSYO

3. constitute_tools
 - Python legal document parsing library. Breaks legal texts into a hierarchical representation, given a set of header regular expressions.
 - https://github.com/rbshaffer/constitute_tools

4. sequence_tagging
 - Python LSTM-CRF library, for general-purpose named entity recognition.
 - The version included with replication materials has some small modifications to deal with path navigation, but no substantive changes.
 - Trained using pre-built GloVe vectors (not included in replication materials).
 - https://github.com/guillaumegenthial/sequence_tagging
